Indexing huge genome sequences for solving various problems.

نویسندگان

K Sadakane

T Shibuya

چکیده

Because of the increase in the size of genome sequence databases, the importance of indexing the sequences for fast queries grows. Suffix trees and suffix arrays are used for simple queries. However these are not suitable for complicated queries from huge amount of sequences because the indices are stored in disk which has slow access speed. We propose storing the indices in memory in a compressed form. We use the compressed suffix array. It compactly stores the suffix array at the cost of theoretically a small slowdown in access speed. We experimentally show that the overhead of using the compressed suffix array is reasonable in practice. We also propose an approximate string matching algorithm which is suitable for the compressed suffix array. Furthermore, we have constructed the compressed suffix array of the whole human genome. Because its size is about 2G bytes, a workstation can handle the search index for the whole data in main memory, which will accelerate the speed of solving various problems in genome informatics.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Resource-frugal Probabilistic Dictionary and Applications in (Meta)Genomics

Genomic and metagenomic fields, generating huge sets of short genomic sequences, brought their own share of high performance problems. To extract relevant pieces of information from the huge data sets generated by current sequencing techniques, one must rely on extremely scalable methods and solutions. Indexing billions of objects is a task considered too expensive while being a fundamental nee...

متن کامل

An Efficient Algorithm for Finding Similar Short Substrings from Large Scale String Data

Finding similar substrings/substructures is a central task in analyzing huge amounts of string data such as genome sequences, web documents, log data, etc. In the sense of complexity theory, the existence of polynomial time algorithms for such problems is usually trivial since the number of substrings is bounded by the square of their lengths. However, straightforward algorithms do not work for...

متن کامل

Cloning and determination of biochemical properties of protective and broadly conserved vaccine antigens from the genome of extraintestinal pathogenic Escherichia coli into pET28a vector

Urinary tract infections are one of the most common infectious diseases that lead to significant health problems in the world. Urinary tract infections are referred to any infection in any part of the renal system. Uropathogenic Escherichia coli, Proteus mirabilis, and Klebsiella are main organisms that are involved in these infections. After identifying same protective and conserved virulence ...

متن کامل

A Fast Divide-and-Conquer Algorithm for Indexing Human Genome Sequences

Since the release of human genome sequences, one of the most important research issues is about indexing the genome sequences, and the suffix tree is most widely adopted for that purpose. The traditional suffix tree construction algorithms have severe performance degradation due to the memory bottleneck problem. The recent disk-based algorithms also have limited performance improvement due to r...

متن کامل

May 16 th 2012 Combinatorial Pattern Matching

1) Title: Algorithms for genome assembly Authors: Leena Salmela, Veli Mäkinen, Niko Välimäki, Johannes Ylinen, Esko Ukkonen Presenter: Leena Salmela Description: Current DNA sequencing technologies can produce huge amounts of short reads. The de novo genome assembly problem is to infer the genome of an organism based on these reads. We have studied several problems related to the assembly probl...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Genome informatics. International Conference on Genome Informatics

دوره 12 شماره

صفحات -

تاریخ انتشار 2001

Indexing huge genome sequences for solving various problems.

نویسندگان

چکیده

منابع مشابه

A Resource-frugal Probabilistic Dictionary and Applications in (Meta)Genomics

An Efficient Algorithm for Finding Similar Short Substrings from Large Scale String Data

Cloning and determination of biochemical properties of protective and broadly conserved vaccine antigens from the genome of extraintestinal pathogenic Escherichia coli into pET28a vector

A Fast Divide-and-Conquer Algorithm for Indexing Human Genome Sequences

May 16 th 2012 Combinatorial Pattern Matching

عنوان ژورنال:

اشتراک گذاری